21 research outputs found

    Discovery and application of data dependencies

    Get PDF
    Orientador: Prof. Dr. Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 08/09/2020Inclui referências: p. 126-140Área de concentração: Ciência da ComputaçãoResumo: D ependências de dados (ou, simplesmente, dependências) têm um papel fundamental em muitos aspectos do gerenciam ento de dados. Em consequência, pesquisas recentes têm desenvolvido contribuições para im portante problem as relacionados à dependências. Esta tese traz contribuições que abrangem dois desses problemas. O prim eiro problem a diz respeito à descoberta de dependências com alto poder de expressividade. O objetivo é substituir o projeto m anual de dependências, o qual é sujeito a erros, por um algoritmo capaz de descobrir dependências a partir de dados apenas. N esta tese, estudamos a descoberta de restrições de negação, um tipo de dependência que contorna muitos problemas relacionados ao poder de expressividade de depêndencias. As restrições de negação têm poder de expressividade suficiente para generalizar outros tipos importantes de dependências, e expressar com plexas regras de negócios. No entanto, sua descoberta é com putacionalm ente difícil, pois possui um espaço de busca m aior do que o espaço de busca visto na descoberta de dependências mais simples. Esta tese apresenta novas técnicas na forma de um algoritmo para a descoberta de restrições de negação. Avaliamos o projeto de nosso algoritmo em uma variedade de cenários: conjuntos de dados reais e sintéticos; e núm eros variáveis de registros e colunas. N ossa avaliação m ostra que, em com paração com soluções do estado da arte, nosso algoritmo m elhora significativamente a eficiência da descoberta de restrição de negação em term os de tempo de execução. O segundo problem a diz respeito à aplicação de dependências no gerenciam ento de dados. Primeiro, estudamos a aplicação de dependências na melhoraria da consistência de dados, um aspecto crítico da qualidade dos dados. Uma m aneira comum de m odelar inconsistências é identificando violações de dependências. N esse contexto, esta tese apresenta um m étodo que estende nosso algoritm o para a descoberta de restrições de negação de form a que ele possa retornar resultados confiáveis, m esm o que o algoritm o execute sobre dados contendo alguns registros inconsistentes. M ostram os que é possível extrair evidências dos conjuntos de dados para descobrir restrições de negação que se mantêm aproximadamente. Nossa avaliação mostra que nosso método retorna dependências de negação que podem identificar, com boa precisão e recuperação, inconsistências no conjunto de dados de entrada. Esta tese traz mais um a contribuição no que diz respeito à aplicação de dependências para m elhorar a consistência de dados. Ela apresenta um sistem a para detectar violações de dependências de form a eficiente. Realizam os um a extensa avaliação de nosso sistem a usando comparações com várias abordagens; dados do mundo real e sintéticos; e vários tipos de restrições de negação. Mostramos que os sistemas de gerenciamento de banco de dados comerciais testados com eçam a apresentar baixo desem penho para conjuntos de dados relativam ente pequenos e alguns tipos de restrições de negação. Nosso sistema, por sua vez, apresenta execuções até três ordens de magnitude mais rápidas do que as de outras soluções relacionadas, especialmente para conjuntos de dados maiores e um grande número de violações identificadas. N ossa contribuição final diz respeito à aplicação de dependências na otim ização de consultas. Em particular, esta tese apresenta um sistema para a descoberta automática e seleção de dependências funcionais que potencialmente melhoram a execução de consultas. Nosso sistema com bina representações das dependências funcionais descobertas em um conjunto de dados com representações extraídas de cargas de trabalho de consulta. Essa com binação direciona a seleção de dependências funcionais que podem produzir reescritas de consulta para as consultas de entrada. N ossa avaliação experim ental m ostra que nosso sistem a seleciona dependências funcionais relevantes que podem ajudar na redução do tempo de resposta geral de consultas. Palavras-chave: Perfilamento de dados. Qualidade de dados. Limpeza de dados. Depenência de dados. Execução de consulta.Abstract: Data dependencies (or dependencies, for short) have a fundamental role in many facets of data management. As a result, recent research has been continually driving contributions to central problem s in connection w ith dependencies. This thesis makes contributions that reach two of these problems. The first problem regards the discovery of dependencies of high expressive power. The goal is to replace the error-prone process of m anual design of dependencies with an algorithm capable of discovering dependencies using only data. In this thesis, we study the discovery of denial constraints, a type of dependency that circumvents many expressiveness drawbacks. Denial constraints have enough expressive pow er to generalize other im portant types of dependencies and to express com plex business rules. However, their discovery is com putationally hard since it regards a search space that is bigger than the search space seen in the discovery of sim pler dependencies. This thesis introduces novel algorithm ic techniques in the form of an algorithm for the discovery of denial constraints. We evaluate the design of our algorithm in a variety of scenarios: real and synthetic datasets; and a varying num ber of records and columns. Our evaluation shows that, com pared to state-of-the-art solutions, our algorithm significantly improves the efficiency of denial constraint discovery in terms of runtime. The second problem concerns the application of dependencies in data management. We first study the application of dependencies for improving data consistency, a critical aspect of data quality. A com m on way to m odel data inconsistencies is by identifying violations of dependencies. in that context, this thesis presents a m ethod that extends our algorithm for the discovery of denial constraints such that it can return reliable results even if the algorithm runs on data containing some inconsistent records. A central insight is that it is possible to extract evidence from datasets to discover denial constraints that alm ost hold in the dataset. Our evaluation shows that our method returns denial dependencies that can identify, with good precision and recall, inconsistencies in the input dataset. This thesis makes one m ore contribution regarding the application of dependencies for im proving data consistency. it presents a system for detecting violations of dependencies efficiently. We perform an extensive evaluation of our system that includes comparisons with several different approaches; real-world and synthetic data; and various kinds of denial constraints. We show that the tested com m ercial database m anagem ent systems start underperform ing for relatively small datasets and production dependencies in the form of denial constraints. Our system, in turn, is up to three orders-of-m agnitude faster than related solutions, especially for larger datasets and massive numbers of identified violations. Our final contribution regards the application of dependencies in query optimization. In particular, this thesis presents a system for the automatic discovery and selection of functional dependencies that potentially improve query executions. Our system combines representations from the functional dependencies discovered in a dataset with representations of the query workloads that run for that dataset. This combination guides the selection of functional dependencies that can produce query rewritings for the incoming queries. Our experimental evaluation shows that our system selects relevant functional dependencies, which can help in reducing the overall query response time. Keywords: D ata profiling. D ata quality. D ata cleaning. D ata dependencies. Query execution

    Mapping density, diversity and species-richness of the Amazon tree flora

    Get PDF
    Using 2.046 botanically-inventoried tree plots across the largest tropical forest on Earth, we mapped tree species-diversity and tree species-richness at 0.1-degree resolution, and investigated drivers for diversity and richness. Using only location, stratified by forest type, as predictor, our spatial model, to the best of our knowledge, provides the most accurate map of tree diversity in Amazonia to date, explaining approximately 70% of the tree diversity and species-richness. Large soil-forest combinations determine a significant percentage of the variation in tree species-richness and tree alpha-diversity in Amazonian forest-plots. We suggest that the size and fragmentation of these systems drive their large-scale diversity patterns and hence local diversity. A model not using location but cumulative water deficit, tree density, and temperature seasonality explains 47% of the tree species-richness in the terra-firme forest in Amazonia. Over large areas across Amazonia, residuals of this relationship are small and poorly spatially structured, suggesting that much of the residual variation may be local. The Guyana Shield area has consistently negative residuals, showing that this area has lower tree species-richness than expected by our models. We provide extensive plot meta-data, including tree density, tree alpha-diversity and tree species-richness results and gridded maps at 0.1-degree resolution

    Consistent patterns of common species across tropical tree communities

    Get PDF
    Trees structure the Earth’s most biodiverse ecosystem, tropical forests. The vast number of tree species presents a formidable challenge to understanding these forests, including their response to environmental change, as very little is known about most tropical tree species. A focus on the common species may circumvent this challenge. Here we investigate abundance patterns of common tree species using inventory data on 1,003,805 trees with trunk diameters of at least 10 cm across 1,568 locations1,2,3,4,5,6 in closed-canopy, structurally intact old-growth tropical forests in Africa, Amazonia and Southeast Asia. We estimate that 2.2%, 2.2% and 2.3% of species comprise 50% of the tropical trees in these regions, respectively. Extrapolating across all closed-canopy tropical forests, we estimate that just 1,053 species comprise half of Earth’s 800 billion tropical trees with trunk diameters of at least 10 cm. Despite differing biogeographic, climatic and anthropogenic histories7, we find notably consistent patterns of common species and species abundance distributions across the continents. This suggests that fundamental mechanisms of tree community assembly may apply to all tropical forests. Resampling analyses show that the most common species are likely to belong to a manageable list of known species, enabling targeted efforts to understand their ecology. Although they do not detract from the importance of rare species, our results open new opportunities to understand the world’s most diverse forests, including modelling their response to environmental change, by focusing on the common species that constitute the majority of their trees.Publisher PDFPeer reviewe

    Rarity of monodominance in hyperdiverse Amazonian forests.

    Get PDF
    Tropical forests are known for their high diversity. Yet, forest patches do occur in the tropics where a single tree species is dominant. Such "monodominant" forests are known from all of the main tropical regions. For Amazonia, we sampled the occurrence of monodominance in a massive, basin-wide database of forest-inventory plots from the Amazon Tree Diversity Network (ATDN). Utilizing a simple defining metric of at least half of the trees ≥ 10 cm diameter belonging to one species, we found only a few occurrences of monodominance in Amazonia, and the phenomenon was not significantly linked to previously hypothesized life history traits such wood density, seed mass, ectomycorrhizal associations, or Rhizobium nodulation. In our analysis, coppicing (the formation of sprouts at the base of the tree or on roots) was the only trait significantly linked to monodominance. While at specific locales coppicing or ectomycorrhizal associations may confer a considerable advantage to a tree species and lead to its monodominance, very few species have these traits. Mining of the ATDN dataset suggests that monodominance is quite rare in Amazonia, and may be linked primarily to edaphic factors

    Unraveling Amazon tree community assembly using Maximum Information Entropy: a quantitative analysis of tropical forest ecology

    Get PDF
    In a time of rapid global change, the question of what determines patterns in species abundance distribution remains a priority for understanding the complex dynamics of ecosystems. The constrained maximization of information entropy provides a framework for the understanding of such complex systems dynamics by a quantitative analysis of important constraints via predictions using least biased probability distributions. We apply it to over two thousand hectares of Amazonian tree inventories across seven forest types and thirteen functional traits, representing major global axes of plant strategies. Results show that constraints formed by regional relative abundances of genera explain eight times more of local relative abundances than constraints based on directional selection for specific functional traits, although the latter does show clear signals of environmental dependency. These results provide a quantitative insight by inference from large-scale data using cross-disciplinary methods, furthering our understanding of ecological dynamics

    Unraveling Amazon tree community assembly using Maximum Information Entropy: a quantitative analysis of tropical forest ecology

    Get PDF
    In a time of rapid global change, the question of what determines patterns in species abundance distribution remains a priority for understanding the complex dynamics of ecosystems. The constrained maximization of information entropy provides a framework for the understanding of such complex systems dynamics by a quantitative analysis of important constraints via predictions using least biased probability distributions. We apply it to over two thousand hectares of Amazonian tree inventories across seven forest types and thirteen functional traits, representing major global axes of plant strategies. Results show that constraints formed by regional relative abundances of genera explain eight times more of local relative abundances than constraints based on directional selection for specific functional traits, although the latter does show clear signals of environmental dependency. These results provide a quantitative insight by inference from large-scale data using cross-disciplinary methods, furthering our understanding of ecological dynamics

    Mapping density, diversity and species-richness of the Amazon tree flora

    Get PDF
    Using 2.046 botanically-inventoried tree plots across the largest tropical forest on Earth, we mapped tree species-diversity and tree species-richness at 0.1-degree resolution, and investigated drivers for diversity and richness. Using only location, stratified by forest type, as predictor, our spatial model, to the best of our knowledge, provides the most accurate map of tree diversity in Amazonia to date, explaining approximately 70% of the tree diversity and species-richness. Large soil-forest combinations determine a significant percentage of the variation in tree species-richness and tree alpha-diversity in Amazonian forest-plots. We suggest that the size and fragmentation of these systems drive their large-scale diversity patterns and hence local diversity. A model not using location but cumulative water deficit, tree density, and temperature seasonality explains 47% of the tree species-richness in the terra-firme forest in Amazonia. Over large areas across Amazonia, residuals of this relationship are small and poorly spatially structured, suggesting that much of the residual variation may be local. The Guyana Shield area has consistently negative residuals, showing that this area has lower tree species-richness than expected by our models. We provide extensive plot meta-data, including tree density, tree alpha-diversity and tree species-richness results and gridded maps at 0.1-degree resolution

    Discovery and application of data dependencies

    No full text
    Orientador: Prof. Dr. Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 08/09/2020Inclui referências: p. 126-140Área de concentração: Ciência da ComputaçãoResumo: D ependências de dados (ou, simplesmente, dependências) têm um papel fundamental em muitos aspectos do gerenciam ento de dados. Em consequência, pesquisas recentes têm desenvolvido contribuições para im portante problem as relacionados à dependências. Esta tese traz contribuições que abrangem dois desses problemas. O prim eiro problem a diz respeito à descoberta de dependências com alto poder de expressividade. O objetivo é substituir o projeto m anual de dependências, o qual é sujeito a erros, por um algoritmo capaz de descobrir dependências a partir de dados apenas. N esta tese, estudamos a descoberta de restrições de negação, um tipo de dependência que contorna muitos problemas relacionados ao poder de expressividade de depêndencias. As restrições de negação têm poder de expressividade suficiente para generalizar outros tipos importantes de dependências, e expressar com plexas regras de negócios. No entanto, sua descoberta é com putacionalm ente difícil, pois possui um espaço de busca m aior do que o espaço de busca visto na descoberta de dependências mais simples. Esta tese apresenta novas técnicas na forma de um algoritmo para a descoberta de restrições de negação. Avaliamos o projeto de nosso algoritmo em uma variedade de cenários: conjuntos de dados reais e sintéticos; e núm eros variáveis de registros e colunas. N ossa avaliação m ostra que, em com paração com soluções do estado da arte, nosso algoritmo m elhora significativamente a eficiência da descoberta de restrição de negação em term os de tempo de execução. O segundo problem a diz respeito à aplicação de dependências no gerenciam ento de dados. Primeiro, estudamos a aplicação de dependências na melhoraria da consistência de dados, um aspecto crítico da qualidade dos dados. Uma m aneira comum de m odelar inconsistências é identificando violações de dependências. N esse contexto, esta tese apresenta um m étodo que estende nosso algoritm o para a descoberta de restrições de negação de form a que ele possa retornar resultados confiáveis, m esm o que o algoritm o execute sobre dados contendo alguns registros inconsistentes. M ostram os que é possível extrair evidências dos conjuntos de dados para descobrir restrições de negação que se mantêm aproximadamente. Nossa avaliação mostra que nosso método retorna dependências de negação que podem identificar, com boa precisão e recuperação, inconsistências no conjunto de dados de entrada. Esta tese traz mais um a contribuição no que diz respeito à aplicação de dependências para m elhorar a consistência de dados. Ela apresenta um sistem a para detectar violações de dependências de form a eficiente. Realizam os um a extensa avaliação de nosso sistem a usando comparações com várias abordagens; dados do mundo real e sintéticos; e vários tipos de restrições de negação. Mostramos que os sistemas de gerenciamento de banco de dados comerciais testados com eçam a apresentar baixo desem penho para conjuntos de dados relativam ente pequenos e alguns tipos de restrições de negação. Nosso sistema, por sua vez, apresenta execuções até três ordens de magnitude mais rápidas do que as de outras soluções relacionadas, especialmente para conjuntos de dados maiores e um grande número de violações identificadas. N ossa contribuição final diz respeito à aplicação de dependências na otim ização de consultas. Em particular, esta tese apresenta um sistema para a descoberta automática e seleção de dependências funcionais que potencialmente melhoram a execução de consultas. Nosso sistema com bina representações das dependências funcionais descobertas em um conjunto de dados com representações extraídas de cargas de trabalho de consulta. Essa com binação direciona a seleção de dependências funcionais que podem produzir reescritas de consulta para as consultas de entrada. N ossa avaliação experim ental m ostra que nosso sistem a seleciona dependências funcionais relevantes que podem ajudar na redução do tempo de resposta geral de consultas. Palavras-chave: Perfilamento de dados. Qualidade de dados. Limpeza de dados. Depenência de dados. Execução de consulta.Abstract: Data dependencies (or dependencies, for short) have a fundamental role in many facets of data management. As a result, recent research has been continually driving contributions to central problem s in connection w ith dependencies. This thesis makes contributions that reach two of these problems. The first problem regards the discovery of dependencies of high expressive power. The goal is to replace the error-prone process of m anual design of dependencies with an algorithm capable of discovering dependencies using only data. In this thesis, we study the discovery of denial constraints, a type of dependency that circumvents many expressiveness drawbacks. Denial constraints have enough expressive pow er to generalize other im portant types of dependencies and to express com plex business rules. However, their discovery is com putationally hard since it regards a search space that is bigger than the search space seen in the discovery of sim pler dependencies. This thesis introduces novel algorithm ic techniques in the form of an algorithm for the discovery of denial constraints. We evaluate the design of our algorithm in a variety of scenarios: real and synthetic datasets; and a varying num ber of records and columns. Our evaluation shows that, com pared to state-of-the-art solutions, our algorithm significantly improves the efficiency of denial constraint discovery in terms of runtime. The second problem concerns the application of dependencies in data management. We first study the application of dependencies for improving data consistency, a critical aspect of data quality. A com m on way to m odel data inconsistencies is by identifying violations of dependencies. in that context, this thesis presents a m ethod that extends our algorithm for the discovery of denial constraints such that it can return reliable results even if the algorithm runs on data containing some inconsistent records. A central insight is that it is possible to extract evidence from datasets to discover denial constraints that alm ost hold in the dataset. Our evaluation shows that our method returns denial dependencies that can identify, with good precision and recall, inconsistencies in the input dataset. This thesis makes one m ore contribution regarding the application of dependencies for im proving data consistency. it presents a system for detecting violations of dependencies efficiently. We perform an extensive evaluation of our system that includes comparisons with several different approaches; real-world and synthetic data; and various kinds of denial constraints. We show that the tested com m ercial database m anagem ent systems start underperform ing for relatively small datasets and production dependencies in the form of denial constraints. Our system, in turn, is up to three orders-of-m agnitude faster than related solutions, especially for larger datasets and massive numbers of identified violations. Our final contribution regards the application of dependencies in query optimization. In particular, this thesis presents a system for the automatic discovery and selection of functional dependencies that potentially improve query executions. Our system combines representations from the functional dependencies discovered in a dataset with representations of the query workloads that run for that dataset. This combination guides the selection of functional dependencies that can produce query rewritings for the incoming queries. Our experimental evaluation shows that our system selects relevant functional dependencies, which can help in reducing the overall query response time. Keywords: D ata profiling. D ata quality. D ata cleaning. D ata dependencies. Query execution

    NEOTROPICAL ALIEN MAMMALS: a data set of occurrence and abundance of alien mammals in the Neotropics

    No full text
    Biological invasion is one of the main threats to native biodiversity. For a species to become invasive, it must be voluntarily or involuntarily introduced by humans into a nonnative habitat. Mammals were among first taxa to be introduced worldwide for game, meat, and labor, yet the number of species introduced in the Neotropics remains unknown. In this data set, we make available occurrence and abundance data on mammal species that (1) transposed a geographical barrier and (2) were voluntarily or involuntarily introduced by humans into the Neotropics. Our data set is composed of 73,738 historical and current georeferenced records on alien mammal species of which around 96% correspond to occurrence data on 77 species belonging to eight orders and 26 families. Data cover 26 continental countries in the Neotropics, ranging from Mexico and its frontier regions (southern Florida and coastal-central Florida in the southeast United States) to Argentina, Paraguay, Chile, and Uruguay, and the 13 countries of Caribbean islands. Our data set also includes neotropical species (e.g., Callithrix sp., Myocastor coypus, Nasua nasua) considered alien in particular areas of Neotropics. The most numerous species in terms of records are from Bos sp. (n = 37,782), Sus scrofa (n = 6,730), and Canis familiaris (n = 10,084); 17 species were represented by only one record (e.g., Syncerus caffer, Cervus timorensis, Cervus unicolor, Canis latrans). Primates have the highest number of species in the data set (n = 20 species), partly because of uncertainties regarding taxonomic identification of the genera Callithrix, which includes the species Callithrix aurita, Callithrix flaviceps, Callithrix geoffroyi, Callithrix jacchus, Callithrix kuhlii, Callithrix penicillata, and their hybrids. This unique data set will be a valuable source of information on invasion risk assessments, biodiversity redistribution and conservation-related research. There are no copyright restrictions. Please cite this data paper when using the data in publications. We also request that researchers and teachers inform us on how they are using the data
    corecore